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Learning  complex  cell  invariance  from  natural 
videos:  a  plausibility  proof 


Timothee  Masquelier,  Thomas  Serre,  Simon  Thorpe  and  Tomaso  Poggio 


Abstract 


One  of  the  most  striking  feature  of  the  cortex  is  its  ability  to  wire  itself.  Understanding  how  the  visual 
cortex  wires  up  through  development  and  how  visual  experience  refines  connections  into  adulthood  is  a 
key  question  for  Neuroscience.  While  computational  models  of  the  visual  cortex  are  becoming  increasingly 
detailed  (Riesenhuber  and  Poggio,  1999;  Giese  and  Poggio,  2003;  Deco  and  Rolls,  2004,  2005;  Serre  et  al., 
2005a;  Rolls  and  Stringer,  2006;  Berzhanskaya  et  al.,  2007;  Masquelier  and  Thorpe,  2007),  the  question  of 
how  such  architecture  could  self-organize  through  visual  experience  is  often  overlooked. 

Here  we  focus  on  the  class  of  hierarchical  feedforward  models  of  the  ventral  stream  of  the  visual  cortex 
(Fukushima,  1980;  Perrett  and  Oram,  1993;  Wallis  and  Rolls,  1997;  Mel,  1997;  VanRullen  et  al.,  1998;  Riesen¬ 
huber  and  Poggio,  1999;  Ullman  et  al.,  2002;  Hochstein  and  Ahissar,  2002;  Amit  and  Mascaro,  2003;  Wersing 
and  Koerner,  2003;  Serre  et  al.,  2005a;  Masquelier  and  Thorpe,  2007;  Serre  et  al.,  2007),  which  extend  the 
classical  simple-to-complex  cells  model  by  Hubei  and  Wiesel  (1962)  to  extra-striate  areas,  and  have  been 
shown  to  account  for  a  host  of  experimental  data.  Such  models  assume  two  functional  classes  of  simple 
and  complex  cells  with  specific  predictions  about  their  respective  wiring  and  resulting  functionalities. 

In  these  networks,  the  issue  of  learning,  especially  for  complex  cells,  is  perhaps  the  least  well  understood. 
In  fact,  in  most  of  these  models,  the  connectivity  between  simple  and  complex  cells  is  not  learned  but 
rather  hard-wired.  Several  algorithms  have  been  proposed  for  learning  invariances  at  the  complex  cell 
level  based  on  a  trace  rule  to  exploit  the  temporal  continuity  of  sequences  of  natural  images  (e.g.,  (Foldiak, 
1991;  Wallis  and  Rolls,  1997;  Wiskott  and  Sejnowski,  2002;  Einhauser  et  al.,  2002;  Spratling,  2005)),  but  very 
few  can  learn  from  natural  cluttered  image  sequences. 

Here  we  propose  a  new  variant  of  the  trace  rule  that  only  reinforces  the  synapses  between  the  most  active 
cells,  and  therefore  can  handle  cluttered  environments.  The  algorithm  has  so  far  been  developed  and 
tested  at  the  level  of  Vl-like  simple  and  complex  cells:  we  verified  that  Gabor-like  simple  cell  selectivity 
could  emerge  from  competitive  Hebbian  learning  (see  also  (Delorme  et  al.,  2001;  Einhauser  et  al.,  2002; 
Guyonneau,  2006)).  In  addition,  we  show  how  the  modified  trace  rule  allows  the  subsequent  complex 
cells  to  learn  to  selectively  pool  over  simple  cells  with  the  same  preferred  orientation  but  slightly  different 
positions  thus  increasing  their  tolerance  to  the  precise  position  of  the  stimulus  within  their  receptive  fields. 
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1  Introduction 

Learning  is  arguably  the  key  to  understanding  intel¬ 
ligence  (Poggio  and  Smale,  2003).  One  of  the  most 
striking  feature  of  the  cortex  is  its  ability  to  wire  it¬ 
self.  Understanding  how  the  visual  cortex  wires  up 
through  development  and  how  plasticity  refines  con¬ 
nections  into  adulthood  is  likely  to  give  necessary  con¬ 
straints  to  computational  models  of  visual  processing. 
Surprisingly  there  have  been  relatively  few  computa¬ 
tional  studies  (Perrett  et  al.,  1984;  Foldiak,  1991;  Hieta- 
nen  et  al.,  1992;  Wallis  et  al.,  1993;  Wachsmuth  et  al., 
1994;  Wallis  and  Rolls,  1997;  Stringer  and  Rolls,  2000; 
Rolls  and  Milward,  2000;  Wiskott  and  Sejnowski,  2002; 
Einhauser  et  al.,  2002;  Spratling,  2005)  that  have  tried  to 
address  the  mechanisms  by  which  learning  and  plastic¬ 
ity  may  shape  the  receptive  fields  (RFs)  of  neurons  in 
the  visual  cortex. 

Here  we  study  biologically  plausible  mechanisms  for 
the  learning  of  both  selectivity  and  invariance  of  cells 
in  the  primary  visual  cortex  (VI).  We  focus  on  a  spe¬ 
cific  class  of  models  of  the  ventral  stream  of  the  vi¬ 
sual  cortex,  the  feedforward  hierarchical  models  of  vi¬ 
sual  processing  (Fukushima,  1980;  Perrett  and  Oram, 
1993;  Wallis  and  Rolls,  1997;  Mel,  1997;  VanRullen  et  al., 
1998;  Riesenhuber  and  Poggio,  1999;  Ullman  et  al.,  2002; 
Hochstein  and  Ahissar,  2002;  Amit  and  Mascaro,  2003; 
Wersing  and  Koemer,  2003;  Serre  et  al.,  2005a;  Masque- 
lier  and  Thorpe,  2007;  Serre  et  al.,  2007),  which  extend 
the  classical  simple-to-complex  cells  model  by  Hubei 
and  Wiesel  (1962)  (see  Box  1)  and  have  been  shown  to 
account  for  a  host  of  experimental  data. 

We  have  used  a  specific  implementation  of  such  feed¬ 
forward  hierarchical  models  (Riesenhuber  and  Poggio, 
1999;  Serre  et  al.,  2005b,  2007),  which  makes  predic¬ 
tions  about  the  nature  of  the  computations  and  the  spe¬ 
cific  wiring  of  simple  and  complex  units,  denoted  ,S'i 
and  Ci  units  respectively.  Learning  in  higher  stages  of 
the  model  will  be  addressed  in  future  work.  We  show 
that  with  simple  biologically  plausible  learning  rules, 
these  two  classes  of  cells  can  be  learned  from  natural 
real-world  videos  with  no  supervision.  In  particular, 
we  verified  that  the  Gabor-like  selectivity  of  Si  units 
could  emerge  from  competitive  Hebbian  learning  (see 
also  (Delorme  et  al.,  2001;  Einhauser  et  al.,  2002;  Guy- 
onneau,  2006)).  In  addition,  we  proposed  a  new  mecha¬ 
nism,  which  suggests  how  the  specific  pooling  from  Si 
to  Ci  unit  could  self-organize  by  passive  exposure  to 
natural  input  video  sequences.  We  discuss  the  compu¬ 
tational  requirements  for  such  unsupervised  learning  to 
take  place  and  make  specific  experimental  predictions. 

1.1  Evidence  for  learning  and  plasticity  in  the  visual 
cortex 

In  the  developing  animal,  'rewiring'  experiments  (see 
(Horng  and  Sur,  2006)  for  a  recent  review),  which  re¬ 
route  inputs  from  one  sensory  modality  to  an  area  nor¬ 


mally  processing  a  different  modality,  have  now  estab¬ 
lished  that  visual  experience  can  have  a  pronounced  im¬ 
pact  on  the  shaping  of  cortical  networks.  How  plastic  is 
the  adult  visual  cortex  is  however  still  a  matter  of  de¬ 
bates. 

From  the  computational  perspective,  it  is  very  likely 
that  learning  may  occur  in  all  stages  of  the  visual  cortex. 
For  instance  if  learning  a  new  task  involves  high-level 
object-based  representations,  learning  is  likely  to  occur 
high-up  in  the  hierarchy,  at  the  level  of  IT  or  PFC.  Con¬ 
versely,  if  the  task  to  be  learned  involves  the  fine  dis¬ 
crimination  of  orientations  like  in  perceptual  learning 
tasks,  changes  are  more  likely  to  occur  in  lower  areas  at 
the  level  of  VI,  V2  or  V4  (see  (Ghose,  2004)  for  a  review). 
It  is  also  very  likely  that  changes  in  higher  cortical  areas 
should  occur  at  faster  time  scales  than  changes  in  lower 
areas. 

By  now  there  has  been  several  reports  of  plasticity  in 
all  levels  of  the  ventral  stream  of  the  visual  cortex  (see 
(Kourtzi  and  DiCarlo,  2006),  i.e.,  both  in  higher  areas 
like  PFC  (Rainer  and  Miller,  2000;  Freedman  et  al.,  2003; 
Pasupathy  and  Miller,  2005)  and  IT  (see  for  instance 
(Logothetis  et  al.,  1995;  Rolls,  1995;  Kobatake  et  al.,  1998; 
Booth  and  Rolls,  1998;  Erickson  et  al.,  2000;  Sigala  and 
Logothetis,  2002;  Baker  et  al.,  2002;  Jagadeesh  et  al., 
2001;  Freedman  et  al.,  2006)  in  monkeys  or  the  LOC  in 
humans  (Dolan  et  al.,  1997;  Gauthier  et  al.,  1999;  Kourtzi 
et  al.,  2005;  Op  de  Beeck  et  al.,  2006;  Jiang  et  al.,  2007). 
Plasticity  has  also  been  reported  in  intermediate  areas 
like  in  V4  (Yang  and  Maunsell,  2004;  Rainer  et  al.,  2004) 
or  even  lower  areas  like  VI  (Singer  et  al.,  1982;  Kami 
and  Sagi,  1991;  Yao  and  Dan,  2001;  Schuett  et  al.,  2001; 
Crist  et  al.,  2001),  although  their  extent  and  functional 
significance  is  still  under  debate  (Schoups  et  al.,  2001; 
Ghose  et  al.,  2002;  DeAngelis  et  al.,  1995). 

At  the  cellular  level,  supervised  learning  procedures 
to  validate  Hebb's  covariance  hypothesis  in  vivo  have 
also  been  proposed.  The  covariance  hypothesis  pre¬ 
dicts  that  a  cell's  relative  preference  between  two  stim¬ 
uli  could  be  displaced  towards  one  of  them  by  pairing 
its  presentation  with  imposed  increased  responsiveness 
(through  iontophoresis).  Indeed  it  was  shown  possi¬ 
ble  to  durably  change  some  cells'  RF  properties  in  cat 
primary  visual  cortex,  such  as  ocular  dominance,  orien¬ 
tation  preference,  interocular  orientation  disparity  and 
ON  or  OFF  dominance,  both  during  the  critical  devel¬ 
opmental  period  (Fregnac  et  al.,  1988)  and  in  adulthood 
(McLean  and  Palmer,  1998;  Fregnac  and  Shulz,  1999). 
More  recently,  a  similar  procedure  was  used  to  validate 
the  Spike  Timing  Dependent  Plasticity  in  developing 
rat  visual  cortex  (Meliza  and  Dan,  2006)  using  in  vivo 
whole-cell  recording. 

Altogether  the  evidence  suggests  that  learning  plays 
a  key  role  in  determining  the  wiring  and  the  synaptic 
weights  of  cells  in  the  visual  cortex. 
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Box  1:  The  Hubei  &  Wiesel  hierarchical  model  of  primary  visual  cortex. 


Following  their  work  on  striate  cor¬ 
tex,  Hubei  &  Wiesel  described  a  hi¬ 
erarchy  of  cells  in  the  primary  vi¬ 
sual  cortex:  At  the  bottom  of  the  hi¬ 
erarchy,  the  radially  symmetric  cells 
are  like  LGN  cells  and  respond  best 
to  small  spots  of  light.  Second,  the 
simple  cells  do  not  respond  well  to 
spots  of  light  and  require  bar-like 
(or  edge-like)  stimuli  at  a  particu¬ 
lar  orientation,  position  and  phase 
(i.e.,  white  bar  on  a  black  back¬ 
ground  or  dark  bar  on  a  white  back¬ 
ground).  In  turn,  the  complex  cells 
are  also  selective  for  bars  at  a  partic¬ 
ular  orientation  but  they  are  insen¬ 
sitive  to  both  the  location  and  the 
phase  of  the  bar  within  their  recep¬ 
tive  fields  (RFs).  At  the  top  of  the 
hierarchy  the  hypercomplex  cells  not 
only  respond  to  bars  in  a  position 
and  phase  invariant  way,  just  like 
complex  cells,  but  are  also  selective 
for  bars  of  a  particular  length  (be¬ 
yond  a  certain  length  their  response 
starts  decreasing). 

Hubei  &  Wiesel  suggested  that  such 
increasingly  complex  and  invari¬ 
ant  object  representations  could  be 
progressively  built  by  integrating 
convergent  inputs  from  lower  lev¬ 
els.  For  instance,  as  illustrated  in 
Fig.  1  (reproduced  from  (Hubei  and 
Wiesel,  1962)),  position  invariance 
at  the  complex  cells  level,  could 
be  obtained  by  pooling  over  sim¬ 
ple  cells  at  the  same  preferred  ori¬ 
entation  but  at  slightly  different  po¬ 
sitions. 


Figure  1:  The  Hubei  &  Wiesel  hierarchical  model  for 
building  complex  cells  from  simple  cells.  Reproduced 
from  (Hubei  and  Wiesel,  1962). 


1.2  Simple  and  complex  cell  modeling 

The  key  computational  issue  in  object  recognition  is 
the  specificity-invariance  trade-off:  recognition  must  be 
able  to  finely  discriminate  between  different  objects  or 
object  classes  while  at  the  same  time  be  tolerant  to  ob¬ 
ject  transformations  such  as  scaling,  translation,  illumi¬ 
nation,  changes  in  viewpoint,  changes  in  clutter,  as  well 
as  non-rigid  transformations  (such  as  a  change  of  facial 
expression)  and,  for  the  case  of  categorization,  also  to 
variations  in  shape  within  a  class.  Thus  the  main  com¬ 
putational  difficulty  of  object  recognition  is  achieving  a 
trade-off  between  selectivity  and  invariance.  Extending 
the  hierarchical  model  by  (Hubei  and  Wiesel,  1962)  (see 
Box  1)  to  extrastriate  areas  and  based  on  theoretical  con¬ 
siderations,  Riesenhuber  and  Poggio  (1999)  speculated 
that  only  two  functional  classes  of  units  may  be  neces¬ 
sary  to  achieve  this  trade-off: 

The  simple  S  units  perform  a  TUNING  operation  over 
their  afferents  to  build  object-selectivity.  The  simple 
S  units  receive  convergent  inputs  from  retinotopically 


organized  units  tuned  to  different  preferred  stimuli  and 
combine  these  subunits  with  a  bell-shaped  tuning  func¬ 
tion,  thus  increasing  object  selectivity  and  the  complex¬ 
ity  of  the  preferred  stimulus  (see  (Serre  et  al.,  2005a)  for 
details). 

The  analog  of  the  TUNING  operation  in  computer  vi¬ 
sion  is  the  template  matching  operation  between  an  in¬ 
put  image  and  a  stored  representation.  As  discussed 
in  (Poggio  and  Bizzi,  2004)  neurons  with  a  Gaussian- 
like  bell-shape  tuning  are  prevalent  across  cortex.  For 
instance  simple  cells  in  VI  exhibit  a  Gaussian  tuning 
around  their  preferred  orientation  (Hubei  and  Wiesel, 
1962)  or  even  cells  in  inferotemporal  cortex  are  typically 
tuned  around  a  particular  view  of  their  preferred  object 
(Logothetis  et  al.,  1995;  Booth  and  Rolls,  1998).  From 
the  computational  point  of  view,  Gaussian-like  tuning 
profiles  may  be  key  in  the  generalization  ability  of  cor¬ 
tex  and  networks  that  combine  the  activity  of  several 
units  tuned  with  a  Gaussian  profile  to  different  training 
examples  have  proved  to  be  powerful  learning  scheme 
(Poggio  and  Girosi,  1990;  Poggio  and  Smale,  2003). 
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Figure  2:  M AX-like  operation  from  a  complex  cell  in  area  17  of  the  cat.  Illustrated  is  the  response  of  a  complex  cell  to  the 
simultaneous  presentation  of  two  bars  (see  (Lampl  et  alv  2004))  for  details).  A:  average  membrane  potential  measured  from  the 
response  of  the  cell  to  bars  of  the  optimal  orientation.  Black  traces  are  the  responses  to  dark  bars  (OFF  responses)  and  gray  traces 
are  the  responses  to  bright  bars  (ON  responses).  B:  intensity  plots  obtained  from  the  mean  potentials.  C:  cell  responses  to  each  of 
the  selected  bars  shown  in  B  by  thick  lines  around  the  rectangles.  Lines  in  the  1st  row  and  1st  column  panels  are  the  averaged 
responses  to  the  presentation  of  a  single  bar,  and  the  shaded  area  shows  the  mean  ( ±SE ).  The  inner  panels  present  the  response 
of  the  cell  to  the  simultaneous  presentation  of  the  2  bars  whose  positions  are  given  by  the  corresponding  column  and  row  (gray 
traces),  the  responses  to  the  2  stimuli  presented  individually  (thin  black  traces)  and  the  linear  sum  of  the  2  individual  responses 
(thick  black  traces).  Modified  from  (Lampl  et  al.,  2004). 


The  complex  C  units  receive  convergent  inputs  from 
retinotopically  organized  S  units  tuned  to  the  same 
preferred  stimuli  but  at  slightly  different  positions  and 
scales  with  a  MAX-like  operation,  thereby  introducing 
tolerance  to  scale  and  translation.  MAX  functions  are 
commonly  used  in  signal  processing  (e.g.,  for  select¬ 
ing  peak  correlations)  to  filter  noise  out.  The  existence 
of  a  MAX  operation  in  visual  cortex  was  predicted  by 
(Riesenhuber  and  Poggio,  1999)  from  theoretical  argu¬ 
ments  (and  limited  experimental  evidence  (Sato,  1989) 
and  was  later  supported  experimentally  in  V4  (Gawne 
and  Martin,  2002)  and  in  VI  at  the  complex  cell  level 
(Lampl  et  al.,  2004).  Note  that  a  soft-max  close  to  an 
average  my  be  sufficient  for  robust  and  invariant  object 
recognition  (Serre  and  Poggio,  2005)  and  seems  to  ac¬ 
count  for  a  significant  proportion  of  complex  cells  (Finn 


and  Ferster,  2007).  Fig.  2  (reproduced  from  (Lampl  et  al., 
2004))  illustrates  how  a  complex  cell  may  combine  the 
response  of  oriented  retinotopically  organized  subunits 
(presumably  simple  cells)  at  the  same  preferred  orien¬ 
tation  with  a  MAX  pooling  mechanism. 

Computational  implementation  of  the  Hubei  & 
Wiesel  model:  In  this  work  we  use  static  idealized 
approximation  to  describe  the  response  of  simple  and 
complex  units.  As  described  in  (Knoblich  et  al.,  2007; 
Kouh  and  Poggio,  2007;  Serre  et  al.,  2005a)  both  oper¬ 
ations  can  be  carried  out  by  a  divisive  normalization  fol¬ 
lowed  by  weighted  sum  and  rectification.  Normalization 
mechanisms  (also  commonly  referred  to  as  gain  control) 
in  this  case,  can  be  achieved  by  a  feedforward  (or  recur¬ 
rent)  shunting  inhibition  (Torre  and  Poggio,  1978;  Re- 
ichardt  et  al.,  1983;  Carandini  and  Heeger,  1994). 2 
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Box  2:  Computational  implementation  of  the  Hubei  &  Wiesel  model. 


We  denote  by  {x\,x2,  ■  ■  ■  xn)  the  set  of  inputs  to  a  unit  and  w  =  (w\ ,  . . . ,  w,\r)  their  respective  input  strength.  For 
a  complex  unit,  the  inputs  Xj  are  retinotopically  organized  and  selected  from  an  m  x  m  grid  of  afferent  units1  Of 
afferents  with  the  same  selectivity  (e.g.,  for  an  horizontal  complex  cells,  subunits  are  all  tuned  to  an  horizontal  bar 
but  at  slightly  different  positions  and  spatial  frequencies).  For  a  simple  unit,  the  subunits  are  also  retinotopically 
organized  (selected  from  an  m  x  m  grid  of  possible  afferents).  But,  in  contrast  with  complex  units,  the  subunits  of 
a  simple  cell  could  in  principal  be  with  different  selectivities  to  increase  the  complexity  of  the  preferred  stimulus. 
Mathematically,  both  the  TUNING  operation  and  the  MAX  operation  at  the  simple  and  complex  units  level  can  be 
well  approximated  by  the  following  equation: 


n 


where  y  is  the  output  of  the  unit,  k  «  1  is  a  constant  to  avoid  zero-divisions  and  p,  q  and  r  represent  the  static 
non-linearities  in  the  underlying  neural  circuit. 

Such  non-linearity  may  correspond  to  different  regimes  on  the  /  —  I  curve  of  the  presynaptic  neurons  such  that 
different  operating  ranges  provide  different  degrees  of  non-linearities  (from  near-linearity  to  steep  non-linearity). 
An  extra  sigmoid  transfer  function  on  the  output  g(y)  =  1/(1  +  exp“(y_^)  controls  the  sharpness  of  the  unit 
response. 

By  adjusting  these  non-linearities,  the  equation  above  can  approximate  better  a  MAX  or  a  TUNING  function: 

•  When  p  ;$  qr,  the  unit  approximates  a  Gaussian-like  TUNING,  i.e.,  its  response  y  will  have  a  peak  around 
some  value  proportional  to  the  input  vector  w  =  (u:  \ .  . . . ,  w_\r).  For  instance,  when  p  =  1,  q  =  2  and  r  =  1/2, 
the  circuits  perform  a  normalized  dot-product  with  an  L2  norm,  which  with  the  addition  of  a  bias  term  may 
approximate  a  Gaussian  function  very  closely  (see  (Kouh  and  Poggio,  2007;  Serre  et  al.,  2005a)  for  details). 

•  When  p  ^  q  +  1  (wj  «  1),  the  unit  implements  a  soft-max  and  approximates  a  MAX  function  very  closely  for 
larger  q  values  (see  (Yu  et  al.,  2002),  the  quality  of  the  approximation  also  increases  as  the  inputs  become  more 
dissimilar).  For  instance,  r  «  1,  p  ~  l,  g~2  gives  a  good  approximation  of  the  MAX  (see  (Kouh  and  Poggio, 
2007;  Serre  et  al.,  2005a)  for  details). 


The  detailed  mathematical  formulation  of  the  two  op¬ 
erations  is  given  in  Box  2.  There  are  plausible  local  cir¬ 
cuits  (Serre  et  al.,  2005a)  implementing  the  two  key  op¬ 
erations  within  the  time  constraints  of  the  experimen¬ 
tal  data  (Perrett  et  al.,  1992;  Keysers  et  al.,  2001;  Hung 
et  al.,  2005)  based  on  small  local  population  of  spiking 
neurons  firing  probabilistically  in  proportion  to  the  un¬ 
derlying  analog  value  (Smith  and  Lewicki,  2006)  and  on 
shunting  inhibition  (Grossberg,  1973).  Other  possibil¬ 
ities  may  involve  spike  timing  in  individual  neurons 
(Masquelier  and  Thorpe,  2007)  (see  (VanRullen  et  al., 
2005)  for  a  recent  review).  A  complete  description  of 
the  two  operations,  a  summary  of  the  evidence  as  well 
as  plausible  biophysical  circuits  to  implement  them  can 
be  found  in  (Knoblich  et  al.,  2007;  Serre  et  al.,  2005a). 

While  there  exists  at  least  partial  evidence  for  the 
existence  of  both  Gaussian  TUNING  and  max-like  op¬ 
erations  (see  earlier),  the  question  of  how  the  specific 
wiring  of  simple  and  complex  cells  could  self-organize 
during  development  and  how  their  selectivity  could  be 
shape  through  visual  experience  is  open.  In  the  next 


section,  we  review  related  work  and  speculate  on  com¬ 
putational  mechanisms  that  could  underlie  the  devel¬ 
opment  of  such  circuits. 

1.3  On  learning  simple  and  complex  cells 

Here  we  speculate  that  correlations  play  a  key  role  in 
learning.  Beyond  the  Hebbian  doctrine,  which  says  that 
'neurons  that  fire  together  wire  together',  we  suggest 
that  correlation  in  the  inputs  of  neurons  could  explain 
the  wiring  of  both  simple  and  complex  cells.  As  em¬ 
phasized  by  several  authors,  statistical  regularities  in 
natural  visual  scenes  may  provide  critical  cues  to  the  vi¬ 
sual  system  to  solve  specific  tasks  (Richards  et  al.,  1992; 
Knill  and  Richards,  1996;  Callaway,  1998;  Coppola  et  al., 
1998)  or  even  provide  a  teaching  signal  (Barlow,  1961; 
Sutton  and  Barto,  1981;  Foldiak,  1991)  for  learning  with 
no  supervision.  More  specifically,  we  suggest  that  the 
wiring  of  the  simple  S  units  depends  on  learning  corre¬ 
lations  in  space  while  the  wiring  of  the  C  units  depends 
on  learning  correlations  in  time  (Serre  et  al.,  2005a). 
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The  wiring  of  the  simple  units  corresponds  to  learn¬ 
ing  correlation  between  inputs  at  the  same  time  (i.e.,  for 
simple  Si  units  in  VI,  the  bar-like  arrangements  of  LGN 
inputs,  and  beyond  VI,  more  elaborate  arrangements 
of  bar-like  subunits,  etc  ).  This  corresponds  to  learning 
which  combinations  of  features  appear  most  frequently 
in  images.  That  is,  a  simple  unit  has  to  detect  conjunc¬ 
tions  of  inputs  (i.e.,  sets  of  inputs  that  are  consistently 
co-active),  and  to  become  selective  to  these  patterns. 
This  is  roughly  equivalent  to  learning  a  dictionary  of 
image  patterns  that  appear  with  higher  probability. 

This  is  a  very  simple  and  natural  assumption.  In¬ 
deed  it  follows  a  long  tradition  of  researchers  that  have 
suggested  that  the  visual  system,  through  visual  expe¬ 
rience  and  evolution,  may  be  adapted  to  the  statistics  of 
its  natural  environment  (Attneave,  1954;  Barlow,  1961; 
Atick,  1992;  Ruderman,  1994)  (see  also  (Simoncelli  and 
Olshausen,  2001)  for  a  review).  For  instance,  Attneave 
(1954)  proposed  that  the  goal  of  the  visual  system  is  to 
build  an  efficient  representation  of  the  visual  world  and 
(Barlow,  1961)  emphasized  that  neurons  in  cortex  try  to 
reduce  the  redundancy  present  in  the  natural  environ¬ 
ment. 

This  type  of  learning  can  be  done  with  an  Hebbian 
learning  rule  (von  der  Malsburg,  1973;  Foldiak,  1990). 
Flere  we  used  a  slightly  modified  Flebb  rule,  which 
has  the  advantage  of  keeping  the  synaptic  weights 
bounded,  while  remaining  a  local  learning  rule  (see  Sec¬ 
tion  4,  Eq.  5).  At  the  same  time,  a  mechanism  is  nec¬ 
essary  to  prevent  all  the  simple  units  in  a  given  corti¬ 
cal  column  from  learning  the  same  pattern.  Flere  we 
used  hard  competition  of  the  1- Winner-Take- All  form 
(see  (Rolls  and  Deco,  2002)  for  evidence).  In  the  algo¬ 
rithm  we  describe  below,  at  each  iteration  and  within 
each  hypercolumn  only  the  most  activated  unit  is  al¬ 
lowed  to  learn  (but  it  will  do  so  if  and  only  if  its  activity 
is  above  a  threshold,  see  Section  4.3).  In  the  cortex  such 
a  mechanism  could  be  implemented  by  short  range  lat¬ 
eral  inhibition. 

Networks  with  anti-FIebbian  horizontal  connections 
have  also  been  proposed  (Foldiak,  1990).  While  such 
networks  could,  in  principle,  remove  redundancy  more 
efficiently,  horizontal  connections  are  unlikely  to  play  a 
critical  role  during  the  initial  feedforward  response  of 
neurons  within  the  first  10-30  ms  after  response  onset 
(Thorpe  and  Imbert,  1989;  Thorpe  and  Fabre-Thorpe, 
2001;  Keysers  et  al.,  2001;  Rolls,  2004).  They  could  nev¬ 
ertheless  be  easily  added  in  future  work.  Furthermore, 
a  certain  level  of  redundancy  is  desirable,  to  handle 
noise  and  loss  of  neurons. 

Matching  pursuit,  which  could  also  be  implemented 
in  the  visual  cortex  via  horizontal  connections,  has  been 
proposed  to  reduce  the  redundancy  and  increases  the 
sparseness  of  neuronal  responses  (Perrinet  et  al.,  2004). 

Previous  work  has  already  shown  how  selectiv¬ 
ity  to  orientation  could  emerge  naturally  with  simple 


learning  rules  like  Spike-Timing-Dependant-Plasticity 
(STDP)  (Delorme  et  al.,  2001;  Guyonneau,  2006)  and  a 
Flebbian  rule  (Einhauser  et  al.,  2002).  The  goal  of  the 
work  here  is  to  apply  such  rule  with  a  specific  imple¬ 
mentation  of  a  model  of  the  ventral  stream  of  the  vi¬ 
sual  cortex  (Riesenhuber  and  Poggio,  1999;  Serre  et  al., 
2005a,  2007),  formerly  know  as  FIMAX. 

The  wiring  of  complex  units,  on  the  other  hand,  may 
reflect  learning  from  visual  experience  how  to  associate 
frequent  transformations  in  time  -  such  as  translation 
and  scale  -  of  specific  image  features  coded  by  sim¬ 
ple  cells.  The  wiring  of  the  C  units  reflects  learning 
of  correlations  across  time,  e.g.,  for  complex  C\  units, 
learning  which  afferent  S\  units  with  the  same  orien¬ 
tation  and  neighboring  locations  should  be  wired  to¬ 
gether  because,  often,  such  a  pattern  changes  smoothly 
in  time  (under  translation)  (Foldiak,  1991;  Wiskott  and 
Sejnowski,  2002). 

As  discussed  earlier,  the  goal  of  the  complex  units  is 
to  increase  the  invariance  of  the  representation  along 
one  stimulus  dimension.  This  is  done  by  combining  the 
activity  of  a  group  of  neighboring  simple  units  tuned 
to  the  same  preferred  stimulus  at  slightly  different  po¬ 
sitions  and  scales.  In  this  work  we  focus  on  translation 
invariance,  but  the  same  mechanism,  in  principle,  could 
be  applied  to  any  transformation  (for  e.g.,  scale,  rotation 
or  view-point). 

A  key  question  is  how  a  complex  cell  would  'know' 
which  simple  cells  it  should  connect  to,  i.e.,  which  sim¬ 
ple  cells  do  represent  the  same  object  at  different  loca¬ 
tions?  Note  that  a  standard  Flebbian  rule,  that  learns 
conjunctions  of  inputs,  does  not  work  here,  as  only  one 
(or  a  few)  of  the  targeted  simple  cells  will  be  activated 
at  once.  Instead,  a  learning  rule  is  needed  to  learn  dis¬ 
junctions  of  inputs. 

Several  authors  have  proposed  to  use  temporal  con¬ 
tinuity  to  learn  complex  cells  from  transformation  se¬ 
quences  (Perrett  et  al.,  1984;  Foldiak,  1991;  Flietanen 
et  al.,  1992;  Wallis  et  al.,  1993;  Wachsmuth  et  al.,  1994; 
Wallis  and  Rolls,  1997;  Rolls  and  Milward,  2000;  Wiskott 
and  Sejnowski,  2002).  This  can  be  done  using  an  asso¬ 
ciative  learning  rules  that  incorporate  a  temporal  trace 
of  activity  in  the  post-synaptic  neuron  (Foldiak,  1991), 
exploiting  the  fact  that  objects  seldom  appear  or  disap¬ 
pear,  but  are  often  translated  in  the  visual  field.  Flence 
simple  units  that  are  activated  in  close  temporal  prox¬ 
imity  are  likely  to  represent  the  same  object,  presum¬ 
ably  at  different  locations.  Foldiak  (1991)  proposed  a 
modified  Flebbian  rule,  known  as  the  'trace  rule'  which 
constrain  synapses  to  be  reinforced  when  strong  in¬ 
puts  coincides  with  strong  average  past  activity  (in¬ 
stead  of  strong  current  activity  in  case  of  a  standard 
Flebbian  rule).  This  proposal  has  formed  the  basis  of 
a  large  number  of  algorithms  for  learning  invariances 
from  sequences  of  images  (Becker  and  Flinton,  1992; 
Stone  and  Bray,  1995;  Wallis  and  Rolls,  1997;  Bartlett 
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Figure  3:  Overview  of  the  specific  implementation  of  the 
Hubei  &  Wiesel  VI  model  used.  LGN-like  ON-  and  OFF-cen- 
ter  units  are  modeled  by  Difference-of-Gaussian  (DoG)  filters. 
Simple  units  (denoted  Si)  sample  their  inputs  from  a  7x  7  grid 
of  LGN-type  afferent  units.  Simple  Si  units  are  organized  in 
cortical  hypercolumns  (4x4  grid,  3  pixels  apart,  16  Si  units 
per  hypercolumn).  At  the  next  stage,  4  complex  units  Ci  cells 
receive  inputs  from  these  4  x  4  x  16  Si  cells.  This  paper  focuses 
on  the  learning  of  the  Si  to  Ci  connectivity. 


and  Sejnowski,  1998;  Stringer  and  Rolls,  2000;  Rolls  and 
Milward,  2000;  Wiskott  and  Sejnowski,  2002;  Einhauser 
et  al.,  2002;  Spratling,  2005). 

However,  as  pointed  out  by  Spratling  (2005),  the  trace 
rule  by  itself  is  inappropriate  when  multiple  objects 
are  present  in  a  scene:  it  cannot  distinguish  which  in¬ 
put  corresponds  to  which  object,  and  it  may  end-up 
combining  multiple  objects  in  the  same  representation. 
Hence  most  trace-rule  based  algorithm  require  stimuli 
to  be  presented  in  isolation  (Foldiak,  1991;  Oram  and 
Foldiak,  1996;  Wallis,  1996;  Stringer  and  Rolls,  2000), 
and  would  fail  to  learn  from  cluttered  natural  input  se¬ 
quences. 

To  solve  this  problem,  Spratling  made  the  hypothe¬ 
sis  that  the  same  object  could  not  activate  two  distinct 
inputs,  hence  co-active  units  necessarily  correspond  to 
distinct  objects.  He  proposed  a  learning  rule  that  can 
exploit  this  information,  and  successfully  applied  it  on 
drifting  bar  sequences  (Spratling,  2005). 

However  the  'one  object  activates  one  input'  hypoth¬ 
esis  is  a  strong  one.  It  seems  incompatible  with  the  re¬ 
dundancy  observed  in  the  mammalian  brain  and  repro¬ 
duced  in  our  model.  Instead  we  propose  another  hy¬ 
pothesis:  from  one  frame  to  another  the  most  active 
inputs  are  likely  to  represent  the  same  object.  If  the 
hypothesis  is  true,  by  restraining  the  reinforcement  to 
the  most  active  inputs  we  usually  avoid  to  combine  dif¬ 
ferent  objects  in  the  same  representation  (note  that  this 
idea  was  already  present  in  (Einhauser  et  al.,  2002),  al¬ 
though  not  formulated  in  those  terms). 

In  this  work  we  focus  on  the  learning  of  simple  S\ 
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Figure  4:  Reconstructed  Si  preferred  stimuli  for  each  one  of 
the  4x4  cortical  hypercolumns  (on  this  figure  the  position 
of  the  reconstructions  within  a  cortical  column  is  arbitrary). 
Most  units  show  a  Gabor-like  selectivity  similar  to  what  has 
been  previously  reported  in  the  literature  (see  text). 


and  complex  C\  units  (see  Fig.  3),  which  constitutes  a 
direct  implementation  of  the  Hubei  and  Wiesel  (1962) 
model  of  striate  cortex  (see  Box  1).  The  goal  of  a  C\  unit 
is  to  pool  over  Si  units  with  the  same  preferred  orien¬ 
tation,  but  with  shifted  receptive  fields.  In  this  context 
our  hypothesis  becomes:  'in  a  given  neighborhood,  the 
dominant  orientation  is  likely  to  be  the  same  from  one 
frame  to  another'.  As  our  results  suggests  (see  later), 
this  constitutes  a  reasonable  hypothesis,  which  leads  to 
appropriate  pooling. 

2  Results 

We  tested  the  proposed  learning  mechanisms  in  a 

3  layer  feedforward  network  mimicking  the  Lateral 
Geniculate  Nucleus  (LGN)  and  VI  (see  Fig.  3).  Details 
of  the  implementation  can  be  found  in  Section  4. 

The  stimuli  we  used  were  provided  by  Betsch  et  al. 
(2004).  The  videos  were  captured  by  CCD  cameras  at¬ 
tached  to  a  cat's  head,  while  the  animal  was  exploring 
several  outdoor  environments.  Theses  videos  approxi¬ 
mate  the  input  to  which  the  visual  system  is  naturally 
exposed,  although  eye  movements  are  not  taken  into 
account. 

To  simplify  the  computations,  learning  was  done 
in  two  phases:  First  Si  units  learned  their  selectivity 
through  competitive  Hebbian  learning.  After  conver¬ 
gence,  plasticity  at  the  ,S'i  stage  was  switched  off  and 
learning  at  the  complex  C\  unit  level  started.  In  a  more 
realistic  scenario,  this  two-phase  learning  scheme  could 
be  approximated  with  a  slow  time  constant  for  learning 
at  the  Si  stage  and  a  faster  time  constant  at  the  C\  stage. 
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(a)  Si  units  (n=73)  that  remain  connected  to  C i  unit 
#  1  after  learning 


(b)  Si  units  (n=35)  that  remain  connected  to  C\ 
unit  #  2  after  learning 
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(c)  Si  units  (n=59)  that  remain  connected  to  Ci  (d)  Si  units  (n=38)  that  remain  connected  to  C\ 

unit  #  3  after  learning  unit  #  4  after  learning 


Figure  5:  Pools  of  Si  units  connected  to  each  Ci  unit.  For  e.g.,  Ci  unit  #  1  became  selective  for  horizontal  bars:  After  learning 
only  73  Si  units  (out  of  256)  remain  connected  to  the  Ci  emit,  and  they  are  all  tuned  to  an  horizontal  bar,  but  at  different  positions 
(corresponding  to  different  cortical  columns;  on  this  figure  the  positions  of  the  reconstructions  correspond  to  their  positions  in 
Fig.  4). 


2.1  Simple  cells 

After  about  9  hours  of  simulated  time  Si  units  have 
learned  a  Gabor-like  selectivity  (see  Fig.  4)  similar  to 
what  has  been  previously  reported  for  cortical  cells 
(Hubei  and  Wiesel,  1959, 1962, 1965, 1968;  Schiller  et  al„ 
1976a,b,c;  DeValois  et  al.,  1982a,b;  Jones  and  Palmer, 
1987;  Ringach,  2002).  In  particular,  receptive  fields  are 
localized,  tuned  to  specific  spatial  frequencies  in  a  given 
orientation.  In  this  experiment,  only  four  dominant  ori¬ 
entations  emerged  spanning  the  full  range  of  orienta¬ 
tions  with  45°  increment:  0°  ,  45°  ,  90°  and  135°  .  In¬ 
terestingly,  in  an  another  experiment  using  Si  receptive 


fields  larger  than  the  7x7  receptive  field  sizes  used  here, 
we  found  instead  a  continuum  of  orientations.  The  fact 
that  we  obtain  only  four  orientations  here  is  likely  to  be 
a  discretization  artifact.  With  this  caveat  in  mind,  in  the 
following  we  used  the  7  x  7  RF  sizes  (see  Table  1),  which 
match  the  receptive  field  sizes  of  cat  LGN  cells. 

Our  results  are  in  line  with  previous  studies  that  have 
shown  that  competitive  Hebbian  learning  with  DoG  in¬ 
puts  leads  to  Gabor-like  selectivity  (see  for  instance  (De¬ 
lorme  et  al.,  2001;  Einhauser  et  al.,  2002;  Guyonneau, 
2006)  and  (Olshausen  and  Field,  1996)  for  a  more  so¬ 
phisticated  model). 


8 


2.2  Complex  cells 

In  phase  2,  to  learn  the  receptive  fields  of  the  C\  units, 
we  turned  off  learning  at  the  S\  stage  and  began  to  learn 
the  Si  —  Ci  connectivity.  This  was  done  using  a  learning 
rule  that  reinforce  the  synapse  between  the  currently 
most  activated  Si  unit  and  the  previously  most  acti¬ 
vated  Ci  unit  (see  Section  4.4).  After  19  hours  of  sim¬ 
ulated  time,  we  ended  up  with  binary  Si  —  Ci  weights, 
and  each  Ci  remained  connected  to  a  pool  of  Si  with  the 
same  preferred  orientation,  eventually  in  different  corti¬ 
cal  columns  (see  Fig.  5).  Hence  by  taking  the  (soft)  max¬ 
imum  response  among  its  pool,  a  Ci  unit  becomes  shift- 
invariant  and  inherits  its  orientation  selectivity  from  its 
input  Si  units. 

In  total  38  Si  units  were  not  selected  by  any  Cl  (see 
Fig.  6).  They  either  had  an  atypical  preferred  stimulus 
or  were  tuned  to  an  horizontal  bar,  which,  because  of 
a  possible  bias  in  the  training  data  (maybe  due  to  hor¬ 
izontal  head  movements),  is  over-represented  at  the  Si 
level.  In  addition  we  did  not  find  any  Si  unit  selected 
by  more  than  one  Ci  unit.  In  other  words,  the  pools 
Fig.  5(a),  5(b),  5(c),  5(d)  and  6  were  all  disjoint.  Note 
that  superimposing  those  5  figures  leads  to  Fig.  4. 

We  also  experimented  with  other  learning  rules 
within  the  same  architecture.  We  reimplemented 
Foldiak' s  original  trace  rule  (Foldiak,  1991)  (see  Sec¬ 
tion  4,  Eq.  12).  As  expected,  the  learning  rule  failed 
mainly  due  to  the  fact  that  input  frames  do  not  contain 
isolated  edges  but  instead  edges  with  multiple  orien¬ 
tations.  This,  in  turn,  leads  to  complex  units  that  pool 
over  multiple  orientations. 

We  also  implemented  Spratling's  learning  rule 
(Spratling,  2005),  which  failed  in  a  similar  way,  because 
the  hypothesis  that  'one  edge  activates  one  S\  unit'  is 
violated  here. 

Finally  we  re-implemented  the  rule  by  Einhauser 
et  al.  (2002)  (see  Section  4,  Eq.  10).  We  reproduced 
their  main  results  and  the  learning  rule  generated  a  con¬ 
tinuum  of  S i  —  Ci  weights  (as  opposed  to  our  binary 
weights).  The  strongest  synapses  of  a  given  complex 
cell  did  correspond  to  simple  cells  with  the  same  pre¬ 
ferred  orientation  but,  undesirably,  the  complex  cells 
had  also  formed  connection  to  other  simple  cells  with 
distinct  preferred  orientation  (see  also  Section  3). 

3  Discussion 

Contrary  to  most  previous  approaches  (Foldiak,  1990, 
1991;  Wallis  and  Rolls,  1997;  Stringer  and  Rolls,  2000; 
Rolls  and  Milward,  2000;  Spratling,  2005),  our  approach 
deals  with  natural  image  sequences,  as  opposed  to  ar¬ 
tificial  stimuli  such  as  drifting  bars.  For  any  given 
algorithm  to  be  plausible  a  necessary  condition  (al¬ 
though  not  sufficient)  is  that  it  can  handle  natural  im¬ 
ages,  which  brings  supplementary  difficulties  such  as 
noise,  clutter  and  absence  of  relevant  stimuli.  Models 


Figure  6:  The  38  Si  cells  that  were  not  connected  to  any  C\ . 


that  process  simpler  stimuli  may  be  useful  to  illustrate 
a  given  mechanism,  but  the  ultimate  goal  should  be  to 
deal  with  natural  images,  just  like  humans  do.  To  our 
knowledge,  the  only  model  for  the  learning  of  simple 
and  complex  cells,  which  has  been  shown  to  work  on 
natural  image  sequences  is  the  one  by  Einhauser  et  al. 
(2002).  Our  work  extends  the  study  by  Einhauser  et  al. 
(2002)  in  several  significant  ways. 

First,  by  using  soft-bounds  in  the  weight  update  rule 
(see  Equation  9),  the  proposed  algorithm  converges  to¬ 
wards  input  weights  to  a  complex  unit  that  are  binary. 
This  means  that  each  complex  unit  is  strongly  con¬ 
nected  to  a  pool  of  simple  units  that  all  have  the  same 
preferred  orientation.  This  leads  to  complex  units  with 
an  orientation  bandwidth  similar  to  the  orientation 
bandwidth  of  simple  units  (see  (Serre  and  Riesenhuber, 
2004))  in  agreement  with  experimental  data  (De Valois 
et  al.,  1982b).  Conversely,  the  algorithm  by  Einhauser  et 
al.  generates  a  continuum  of  synaptic  weights  and  al¬ 
though  weaker,  some  of  the  connections  to  simple  units 
to  non-preferred  orientations  remained  thus  broaden¬ 
ing  the  orientation  bandwidth  from  simple  to  complex 
units. 

Another  important  difference  with  the  learning  rule 
used  in  (Einhauser  et  al.,  2002)  is  that  our  modified  Heb- 
bian  learning  rule  is  based  on  the  correlation  between 
the  current  inputs  to  a  complex  unit  and  its  output  at 
the  previous  time  step  (as  opposed  to  previous  input 
and  current  output  in  (Einhauser  et  al.,  2002)).  This 
was  suggested  in  (Rolls  and  Milward,  2000).  Here  we 
found  empirically  that  it  leads  to  faster  and  more  robust 
learning.  It  also  turns  out  to  be  easier  to  implement  in 
biophysical  circuits:  Because  of  synaptic  delays,  it  is  in 
fact  very  natural  to  consider  the  current  input  to  a  unit 
and  its  output  to  the  previous  frames  a  few  tens  of  mil¬ 
liseconds  earlier.  Measuring  correlations  between  past 
inputs  and  current  output  would  need  an  additional 
mechanism  to  store  the  current  input  for  future  use. 
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Finally  our  approach  tends  to  be  simpler  than  most 
of  the  previous  ones.  The  inputs  to  the  model  are  raw 
gray-level  images  without  any  pre-processing  such  as 
low  pass  filtering  or  whitening.  Also  the  proposed  al¬ 
gorithm  does  not  require  any  weight  normalization  and 
all  the  learning  rules  used  are  local. 

Our  neurophysiologically-plausible  approach  also 
contrasts  with  objective  function  approaches,  which  op¬ 
timize  a  given  function  (such  as  sparseness  (Olshausen 
and  Field,  1996;  Rehn  and  Sommer,  2007)  (minimizing 
the  number  of  units  active  for  any  input),  statistical 
independence  (Bell  and  Sejnowski,  1997;  van  Flateren 
and  Ruderman,  1998;  van  Hateren  and  van  der  Schaaf, 
1998;  Hyvarinen  and  Hoyer,  2001)  or  even  temporal 
continuity  and  slowness  (Wiskott  and  Sejnowski,  2002; 
Kording  et  al.,  2004;  Berkes  and  Wiskott,  2005)  in  a  non- 
biologically  plausible  way.  Such  normative  model  can 
provide  insights  as  to  why  receptive  fields  look  the  way 
they  do.  Indeed  such  models  have  made  quantitative 
predictions,  which  have  been  compared  to  neural  data 
(see  (van  Hateren  and  Ruderman,  1998;  van  Hateren 
and  van  der  Schaaf,  1998;  Ringach,  2002)  for  instance). 
However,  such  approaches  ignore  the  computational 
constraints  imposed  by  the  environment  and  are  agnos¬ 
tic  about  how  such  learning  could  be  implemented  in 
the  cortex. 

Fortunately  some  of  them  lead  to  reasonable  rules 
(e.g.r  (Olshausen  and  Field,  1996;  Sprekeler  et  al.,  2007)) 
and  connections  can  be  drawn  between  the  two  classes 
of  approaches.  For  example,  Sprekler  et  al.  recently 
showed  that  Slow  Feature  Analysis  (SFA)  is  in  fact 
equivalent  to  the  trace  rule,  and  could  be  implemented 
by  Spike  Timing  Dependant  Plasticity  (Sprekeler  et  al., 
2007). 

Finally  our  approach  constitutes  a  plausibility  proofs 
for  most  models  of  the  visual  cortex  (Fukushima,  1980; 
Riesenhuber  and  Poggio,  1999;  Ullman  et  al.,  2002;  Serre 
et  al.,  2007;  Masquelier  and  Thorpe,  2007),  which  typi¬ 
cally  learn  the  tuning  of  units  at  one  location  and  simply 
'replicate'  the  tuning  of  units  at  all  locations.  This  is  not 
the  approach  we  undertook  in  this  work:  The  4x4  grid 
of  Si  units  (16  at  each  location)  are  all  learned  indepen¬ 
dently  and  indeed  are  not  identical.  We  then  suggested 
a  mechanism  to  pool  together  cells  with  similar  pre¬ 
ferred  stimulus.  The  success  of  our  approach  validates 
the  simplifying  assumption  of  weight-sharing.  How¬ 
ever  we  still  have  to  test  the  proposed  mechanisms  for 
higher  order  neurons. 

The  idea  of  exploiting  temporal  continuity  to  build 
invariant  representations  finds  partial  support  from 
psychophysical  studies,  which  have  suggested  that  hu¬ 
man  observers  tend  to  associate  together  successively 
presented  views  of  paperclip  objects  (Sinha  and  Pog¬ 
gio,  1996)  or  faces  (Wallis  and  Biilthoff,  2001).  The 
idea  also  seems  consistent  -  as  pointed  out  by  Stryker 
(Stryker,  1991;  Foldiak,  1998;  Giese  and  Poggio,  2003)  - 


Figure  7:  Videos:  the  world  from  a  cat's  perspective  (Betsch 
et  al.,  2004). 


with  an  electrophysiological  study  by  Miyashita  (1988), 
who  showed,  that  training  a  monkey  with  a  fixed  se¬ 
quence  of  image  patterns  lead  to  a  correlated  activity 
between  those  same  patterns  during  the  delayed  activ- 
ity. 

Finally  this  class  of  algorithms  lead  to  an  interesting 
prediction  made  by  Einhauser  et  al.  (2002),  namely  that 
the  selectivity  of  complex  units  could  be  impaired  by 
rearing  an  animal  in  an  environment  in  which  tempo¬ 
ral  continuity  would  be  disrupted  (for  instance  using 
a  stroboscopic  light  or  constantly  flashing  uncorrelated 
pictures).  We  verified  this  prediction  on  our  model 
and  found  that  randomly  shuffling  the  frames  of  the 
videos  had  no  impact  on  the  development  of  the  simple 
S\  units  while  the  selectivity  of  the  complex  C\  units 
was  significantly  impaired  (all  the  synapses  between 
the  simple  and  complex  units  ended  up  depressed). 

To  conclude,  although  this  study  could  be  pushed 
further  -  in  particular  the  proposed  mechanism  should 
be  implemented  on  spiking  neurons,  and  should  be 
tested  on  higher  order  neurons  -  it  does  constitute  a 
plausibility  proof  that  invariances  could  be  learned  us¬ 
ing  a  simple  trace-rule,  even  in  natural  cluttered  envi¬ 
ronment. 

4  Methods 

4.1  Stimuli:  the  world  from  a  cat's  perspective 

The  videos  used  were  taken  from  (Betsch  et  al.,  2004) 
(see  Fig.  7).  The  camera  spans  a  visual  angle  of  71°  by 
53°  and  its  resolution  is  320  x  240  pixels.  Hence  each 
pixel  corresponds  to  about  13  min  of  arc.  We  only  used 
the  first  six  videos  (from  thirteen  total)  for  a  total  du¬ 
ration  of  about  11  minutes.  Spatio-temporal  patches 
were  extracted  from  these  videos  at  fixed  points  from 
a  9  x  11  grid  (sampled  every  25  pixels).  These  99  se- 
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quences  were  concatenated  leading  to  a  total  of  about 
19  hours  of  video  (about  1.6  million  frames). 

In  the  following,  we  set  the  receptive  field  sizes  for 
model  LGN-like,  simple  Si  and  complex  C\  units  to  the 
average  values  reported  in  the  literature  for  foveal  cells 
in  the  cat  visual  cortex  (Hubei  and  Wiesel,  1968).  We 
did  not  model  the  increase  in  RF  size  with  eccentricity 
and  assumed  that  foveal  values  stood  everywhere.  This 
leads  to  receptive  field  sizes  for  the  three  layers  that  are 
summarized  in  Table  1 . 


Table  1:  Receptive  field  sizes  in  pixels,  and  in  degree  of  visual 
angle. 


ON  -  OFF 

Si 

Ci 

Pixels 

7 

13 

22 

Degrees 

1.6 

2.9 

4.9 

4.2  LGN  ON-  and  OFF-center  unit  layer 

Gray  level  images  are  first  analyzed  by  an  array  of 
LGN-like  units  that  correspond  to  7  x  7  Difference-of- 
Gaussian  (DoG)  filters: 


DoG  = 


1 

27T 


(1) 


We  used  a 2  =  1.4  and  o^/or  =  1.6  to  make  the  DOG 
receptive  fields  approximate  a  Laplacian  filter  profile, 
which  in  turn  resembles  the  receptive  fields  of  biologi¬ 
cal  retinal  ganglion  cells  (Marr  and  Hildreth,  1980).  Pos¬ 
itive  values  ended  in  the  ON-center  cell  map,  and  the 
absolute  value  of  negative  values  in  the  OFF-center  cell 
map. 


4.3  S\  layer:  competitive  Hebbian  learning 

Model  Si  units  are  organized  on  a  4  x  4  grid  of  cortical 
columns.  Each  column  contains  16  Si  units  (see  Fig. 
3).  The  distance  between  columns  was  set  to  3  pixels 
(i.e.,  about  half  a  degree  of  visual  angle).  Each  Si  unit 
received  their  inputs  from  a  7  x  7  grid  of  afferent  LGN- 
like  units  (both  ON-  and  OFF-center)  for  a  total  of  7  x 
7x2  input  units.  Si  units  perform  a  bell-shape  TUNING 
(see  (Serre  et  al.,  2005a)  for  details)  function  which  can 
be  approximated  by  the  following  static  mathematical 
operation  (see  Box  2): 


2/raw  — 


En  p 

fc  +  (E"=i  *j)r 


(2) 


Here  for  the  Si  cells  we  set  the  parameters  to:  k  =  0, 
p  =  1,  q  =  2  and  r  =  1/2,  which  is  exactly  a  normalized 
dot-product: 


w.x 


Vraw  —  I,  I, 

11*11 


(3) 


The  reader  should  refer  to  (Knoblich  et  al.,  2007)  for 
biophysical  circuits  of  integrate  and  fire  neurons  that 
use  realistic  parameters  of  synaptic  transmission  ap¬ 
proximating  Eq.  3. 

The  response  of  a  simple  Si  unit  is  maximal  if  the  in¬ 
put  vector  x  is  collinear  to  the  synaptic  weight  vector  w 
(i.e.,  the  preferred  stimulus  of  the  unit).  As  the  pattern 
of  input  becomes  more  dissimilar  to  the  preferred  stim¬ 
ulus,  the  response  of  the  unit  decreases  monotonically 
in  a  bell-shape-like  way  (i.e.,  the  cosine  of  the  angle  be¬ 
tween  the  two  vectors). 

The  unit  activity  yraw  is  further  normalized  by  the  re¬ 
cent  unit  history,  i.e.,  a  'running  average  (denoted  by 
tr(.))  of  the  raw  activities  over  past  few  frames': 


_  Vraw 

raw) 


(4) 


Such  unit  history  is  often  referred  to  as  a  (memory) 
trace  (Foldiak,  1991;  Wallis,  1996;  Wallis  and  Rolls,  1997; 
Stringer  and  Rolls,  2000;  Rolls  and  Milward,  2000).  For 
our  model  Si  unit,  such  normalization  by  the  trace  ap¬ 
proximates  adaptation  effects.  One  can  think  of  ymw 
as  the  membrane  potential  of  the  unit  while  y  approxi¬ 
mates  the  instantaneous  firing  rate  of  the  unit  over  short 
time  intervals:  units  that  have  been  strongly  active  will 
become  less  responsive.  While  non-critical,  this  normal¬ 
ization  by  the  trace  significantly  speeds-up  the  conver¬ 
gence  of  the  learning  algorithm  by  balancing  the  activ¬ 
ity  between  all  Si  units  (the  response  of  units  with  a 
record  of  high  recent  activity  is  reduced  while  the  re¬ 
sponse  of  units  which  have  not  been  active  in  the  recent 
past  is  enhanced).3 

The  initial  w  weights  of  all  the  Si  units  were  initial¬ 
ized  at  random  (sampled  from  a  uniform  distribution 
on  the  [0,1]  interval).  In  each  cortical  hypercolumn  only 
the  most  active  cell  is  allowed  to  fire  (1-Winner-Take- 
All  mechanism).  However,  it  will  do  so  if  and  only  if  its 
activity  reaches  its  threshold  T.  It  will  then  trigger  the 
(modified)  Hebbian  rule: 


Aw  =  a  ■  y  ■  (x  —  w)  (5) 

The  —w,  added  to  the  standard  Hebb  rule,  allows  to 
keep  the  w  bounded.  However,  the  learning  rule  is  still 
fully  local. 

The  winner  then  updates  its  threshold  as  follows: 

T  =  y  (6) 

At  each  time  step,  all  thresholds  are  decreased  as  fol¬ 
lows: 

T=(l-V)-T  (7) 

There  is  experimental  evidence  for  such  threshold 
modulations  in  pyramidal  neurons,  which  contribute  to 
homeostatic  regulation  of  firing  rates.  For  example,  De- 
sai  et  al.  showed  that  depriving  neurons  of  activity  for 
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two  days  increased  sensitivity  to  current  injection  (De- 
sai  et  al.,  1999). 

At  each  time  step  the  traces  are  updated  as  follows: 

tr(yiaw)  =  +  (1  -  ^)  •  tr(yiaw)  (8) 

We  used  77  =  2“ 15  and  v  =  100.  It  was  found  use¬ 
ful  to  geometrically  increase  the  learning  rate  a  for  each 
Si  cell  every  10  weight  updates,  starting  from  an  initial 
value  of  0.01  and  ending  at  0.1  after  200  weight  updates. 
Only  half  of  the  1,683,891  frames  were  needed  to  reach 
convergence. 


4.5  Main  differences  with  Einhauser  et  al.  2002 

•  Learning  rule  for  complex  cells:  first  Einhauser  et 
al.  select  the  C\  winner  at  time  t  (current  frame), 
Jt,  and  the  previous  Si  winner  at  time  t  —  At  (pre¬ 
vious  frame).  It- At-  The  synapse  between  them  is 
reinforced,  while  all  the  other  synapses  of  Jt  are  de¬ 
pressed.  Second,  Einhauser  et  al.  (2002)  use  a  dif¬ 
ferent  weight  update  rule: 


AwiJt  =  \  if  i  —  It— At  (1Q) 

I  -a-Wijt  otherwise. 


4.4  Oi  Layer:  pool  together  consecutive  winners 

4  Ci  cells  receive  inputs  from  the  4  x  4  x  16  Si  cells 
through  synapses  with  weight  w  £  [0, 1]  (initially  set  to 
.75). 

Each  Ci  cell's  activity  is  computed  using  Eq.  2,  but 
this  time  with  p  =  6  (and  still  q  =  2  and  r  =  1/2).  It  has 
been  shown  that  such  operation  performs  a  SOFT-MAX 
(Yu  et  al.,  2002),  and  biophysical  circuits  to  implement 
it  have  been  proposed  in  (Knoblich  et  al.,  2007). 

Winner-Take-All  mechanisms  select  the  Ci  winner  at 
time  t  —  At  (previous  frame),  Jt-At,  and  the  current  Si 
winner  at  time  t  (current  frame).  If.  The  synapse  be¬ 
tween  them  is  reinforced,  while  all  the  other  synapses 
of  Jt-At  are  depressed: 


AWiJt- At 


a+  •  wUt_At  •  (1  -  w,;Jt_At)  if  i  =  It 
a~  •  wiJt_ At  •  (1  -  u>ijt_At)  otherwise. 


(9) 

Synaptic  weights  for  the  non-winning  Ci  cells  are  un¬ 
changed. 

This  learning  rule  was  inspired  by  previous  work  on 
Spike  Timing  Dependent  Plasticity  (STDP)  (Masquelier 
and  Thorpe,  2007).  The  multiplicative  term  tUijt_At  • 
(1  —  ensures  the  weight  remains  in  the  range 

[0,1]  (excitatory  synapses)  and  implements  a  soft  bound 
effect:  when  the  weight  approaches  a  bound,  weight 
changes  tend  toward  zero,  while  the  most  plastic 
synapses  are  those  in  an  intermediate  state. 

As  recommended  by  (Rolls  and  Milward,  2000)  we 
chose  to  exploit  correlations  between  the  previous  out¬ 
put  and  the  current  input  (as  opposed  to  current  out¬ 
put  and  previous  output,  as  Einhauser  et  al.  (2002)).  We 
empirically  confirmed  that  learning  was  indeed  more 
robust  this  way. 

It  was  found  useful  to  geometrically  increase  the 
learning  rates  every  1000  iterations,  while  maintaining 
the  a+/a~  ratio  at  a  constant  value  (-170).  We  started 
with  a+  =  2~3  and  set  the  increase  factor  so  as  to  reach 
a+  =  2_1  at  the  end  of  the  simulation. 


This  learning  rule  leads  to  a  continuum  of  weights 
at  the  end  (as  opposed  to  binary  weights).  Tests 
have  shown  that  the  problem  persists  if  (like  us) 
we  select  the  C\  winner  at  time  t  —  At  (previous 
frame),  Jt-At ,  and  the  current  ,S’i  winner  at  time  t 
(current  frame),  It,  and  apply  the  rule  by  Einhauser 
et  al.  (2002): 


a  -  (1 
-a  ■ 


if  i  =  It 

otherwise 

(11) 


This  suggests  that  the  problem  comes  from  the 
weight  update  rule  that  was  used  in  (Einhauser 
et  al.,  2002),  and  not  from  the  type  of  correlation 
involved. 


•  In  the  model  by  (Einhauser  et  al.,  2002)  the  activity 
of  the  units  is  normalized  by  the  "trace"  (see  ear¬ 
lier)  also  at  the  complex  cell  level. 

•  The  model  by  (Einhauser  et  al.,  2002)  is  similar  to 
an  energy-type  model  of  complex  cells  such  that 
simple  and  complex  cells  have  identical  receptive 
field  sizes.  In  particular,  the  model  does  not  ac¬ 
count  for  the  increase  in  RF  sizes  between  Si  and 
Ci  units  (typically  doubling  (Hubei  and  Wiesel, 
1962). 

4.6  Foldiak's  original  trace  rule 

We  also  reimplemented  Foldiak's  original  trace  rule 

(Foldiak,  1991)  (see  Section  2),  which  is  given  by: 

Aw  =  a  ■  tr{y)  ■  ( x  —  w).  (12) 


Notes 

:As  suggested  by  several  authors  (Fdldiak  and  Young,  1995;  Per- 
rett  et  al.,  1998;  Shadlen  and  Newsome,  1998;  Keysers  et  al.,  2001; 
Serre  et  al.,  2005a),  because  of  the  strong  temporal  constrains  im¬ 
posed  on  the  cortical  circuits  (i.e.,  computations  at  each  stage  have 
to  be  performed  within  very  small  temporal  windows  of  10  —  30  ms 
(Thorpe  and  Imbert,  1989;  Thorpe  and  Fabre-Thorpe,  2001;  Keysers 
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et  al.,  2001;  Rolls,  2004)  under  which  single  neurons  can  transmit  only 
very  few  spikes),  the  basic  units  of  processing  are  likely  to  be  mod¬ 
ules  of  hundreds  of  neurons  with  similar  selectivities  rather  than  in¬ 
dividual  cells.  Such  computational  modules  could  be  implemented 
by  cortical  columns  (Mountcastle,  1957, 1997).  To  paraphrase  Mount- 
castle  (Mountcastle,  1997):  "the  effective  unit  of  operation  in  such  a 
distributed  system  is  not  the  single  neuron  and  its  axon,  but  groups  of 
cells  with  similar  functional  properties  and  anatomical  connections". 

2For  the  past  two  decades  several  studies  (in  VI  for  the  most  part) 
have  provided  evidence  for  the  involvement  of  G AB Aergic  circuits  in 
shaping  the  response  of  neurons  (Sillito,  1984;  Douglas  and  Martin, 
1991;  Ferster  and  Miller,  2000).  Direct  evidence  for  the  existence  of 
divisive  inhibition  comes  from  an  intracellular  recording  study  in  VI 
(Borg-Graham  and  Fregnac,  1998).  Wilson  et  al.  (1994)  also  showed 
the  existence  of  neighboring  pairs  of  pyramidal  cells  /  fast-spiking 
interneurons  (presumably  inhibitory)  in  the  prefrontal  cortex  with  in¬ 
verted  responses  (i.e.,  phased  excitatory /inhibitory  responses).  The 
pyramidal  cell  could  provide  the  substrate  for  the  weighted  sum 
while  the  fast-spiking  neuron  would  provide  the  normalization  term. 

Empirically  we  found  that  the  learning  algorithm  would  still  con¬ 
verge  without  the  normalization  term.  However  the  distribution  of 
preferred  orientations  among  the  learned  Si  units  would  be  far  less 
balanced  than  in  the  full  learning  algorithm  (the  number  of  horizontal 
units  would  outweigh  the  number  of  vertical  ones. 
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